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Abstract 

We consider Wasserstein distance functionals for assessing the convergence of la- 
tent discrete measures, which serve as mixing distributions in hierarchical and non- 
parametric mixture models. We clarify the relationships between Wasserstein dis- 
tances of mixing distributions and /-divergence functionals such as Hellinger and 
Kullback-Leibler distances on the space of mixture distributions using various iden- 
tifiability conditions. The convergence in Wasserstein metrics for discrete measures 
has a natural interpretation of the convergence of individual atoms that provide sup- 
port for the discrete measure. It is also typically stronger than the weak convergence 
induced by standard /-divergence metrics. We establish rates of convergence of pos- 
terior distributions for latent discrete measures in several mixture models, including 
finite mixtures of multivariate distributions, finite mixtures of Gaussian processes and 
infinite mixtures based on the Dirichlet process. 



1 Introduction 

A notable feature in the development of hierarchical and Bayesian nonparametric mod- 
els is the role of discrete probability measures, which serve as mixing distributions to 
combine relatively simp l e mod els into richer classes of statistical models (|Lindsayl. 119951 : 
McLachlan and Basford , 19881) . In recent years the mixture modeling methodology has 
been significantly extended, by many authors taking the mixing measure to be random 
and infinite dimensional via suitable priors constructed in a nested, hierarchical and non- 
parametric manner. This results in rich mode l s that can fit mo r e complex and high di - 
mensional da t a (see , e.g..lGelfand et all (|2005h : iTeh et all (|2006h : [Rodriguez et all (|2008h : 
Petrone et alJ (l2009h: iNguvenl tol(h for several examples of such models, as well as a 



recent book lHjort et al. mm). 

The focus of this paper is to analyze convergence behavior of the posterior distribution 
of latent mixing measures as they arise in several mixture models, including infinite mixture 
and other nonparametric models. Let G = Yli=i Vi^e^ denote a discrete probability mea- 
sure. Atoms 0j's are elements in space G, while vector of probabilities p = {pi, . . . ,pk) 
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lies in a — 1 dimensional probability simplex. In a mixture setting, G is combined with a 
likelihood density f{-\9) with respect to a dominating measure p on X, to yield the mixture 
density: Pg{x) = f f{x\9)dG{9) = '}2!i=iPif{A^i)- In a clustering application, atoms 
0j's represent distinct behaviors in a heterogeneous data population, while mixing probabil- 
ities Pi 's are the associated proportions of such behaviors. Under this interpretation, there 
is a need for comparing and assessing the quality of the discrete measure G estirnated o n 
the basis of available data. An important work in this direction is by Chen ChenI ( 1995 ). 
who used the Li metric on the cumulative distribution functions on the real Une to study 
converg ence rates of the mixin g measure G. Building upon Chen's work, Ishwaran, James 
and Sun llshwaran et al.l (1200 ih established the posterior consistency of a finite dimensional 
Dirichlet prior for Bayesian mixture models. Their analysis is specific to univariate finite 
mixture models, with k bounded by a known constant, while our interest is when k may 
be unbounded and/or Q has high or i nfinite dimens i ons. For instance, m ay be a subset 
of a function space as in the work o f 'Gelfand_etalJ (l2005h : iNguvenI (l2010h . or a space of 
probability measures ^Rodr iguez et al.k i2008). 

The analysis of consistency and convergence rates of posterior distributions for Bayesian 

estimation have se e n much progress in the past decade. Key recent references in c lude 

Barron et a l.\T999'):'G hosal et al.l(|2000h : lshen and Wasserm£ml(|200lh : lwalkej(|2004h : lGhosal and van der Vaart 
( 2007a ) nWalker et al.i (12007 ) . An alysis of specific rn i xture models in a Bayesian settin g 
have also been studied e xtensi v elv lOhosal et all (ll999l'):lGenovese and WassermanI (|2000h : 
Ishwaran and Zarepour (2002); Ghosal and van der Vaart ( 2007bh . All these work primar- 
ily focus on the convergence in the topology of Hellinger or a comparable distance metric 
in the space of data densities pc- On the other hand, there are far less work concerning 
with the convergence behavior of latent mixing measures G. Notably, the analysis of con- 
vergence for mixing (smooth) densities often arised in the context of freque ntist est i matio n 
for deconv olution problems, mainly with kernel density estimation methods IZhangI (Il990l) : 
F^(ll99lh . 

The primary contribution of this paper is to show that the Wasserstein distances pro- 
vide a natural and useful metric for the analysis of convergence for latent and discrete 
mixing measures in mixture models, and to establish convergence rates of posterior distri- 
butions in a number of well-known Bayesian nonparametric and mixture models . Was ser- 
stein distances originally arised in the problem of optima l transportation .Villanii (120031). It 
has been utilized in a numbe r of statistical contexts ( e.g., iDudley (ll976l ): lMallow sl (ll972l ^: 
Bickel and Freedman ( 1981 ): del Barrio et al. ( 19991) ). For discrete probability measures, 
they can be obtained by a minimum matching (or moving) procedure between the sets of 
atoms that provide support for the measures under comparison, and consequentially are 
simple to compute. Suppose that is equipped with a metric p. Let G' = Yl'i=iP'j^9' - 
Then, the Lr Wasserstein distance metric on the space of discrete probability measures 
with support in 0, namely, ^(0), is: 



dp{G,G') 



l/r 
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where the infimumis taken over all joint probability distributions on [1, . . . , k] x [1, . . . , k'] 

such that Y,j Qij = Pi and Y.i Qij = P'j- 

As clearly seen from this definition, the Wasserstein distances inherit directly the metric 
of the space of atomic support 0, suggesting that they can be useful for assessing estimation 
procedures for discrete measures in hierarchical models. It is worth noting that if (Gn)n>i 
is a sequence of discrete probability measures with k distinct atoms and G„ tends to some 
Go in dp metric, then G„'s ordered set of atoms must converge to Gq's atoms in p after 
some permutation of atom labels. Thus, in the clustering application illustrated above, 
convergence of mixing measure G may be interpreted as the convergence of distinct typical 
behavior ^^'s that characterize the heterogeneous data population. A hint for the relevance 
of the Wasserstein distances can be drawn from an observ ation that the Li distance for the 
CDF's of univariate random variables, as studied by Chen iChenI (Il995h . is in fact a special 
case of the Li Wasserstein metric when = M. 

The plan for the paper is as follows. Section |2]explores the relationship between Wasser- 
stein distances for mixing measures and well-known divergence functional for mixture 
densities in a mixture model. We produce a simple lemma which gives an upper bound of 
/-divergences between mixture densities by certain Wasserstein distances between mixing 
measures. This implies that dp topology can be stronger than those induced by divergences 
between mixture densities. Next, we consider various identifi ability conditions under which 
convergence of mixture densities entails convergence of mixing measures in the Wasser- 
stein metric. We present two theorems, which provide upper bounds of dp{G, G') in terms 
of divergences between pc and pc ■ Theorem [T] is applicable to mixing m easures with a 
bounded number of atomic support, generalizing a result from Chen ( 1995b . Theorem[2]is 
applicable to mixing measures with unbounded number of support points, but is restricted 
to only convolution mixture models. 

Section [3] focuses on the convergence of posterior distributions of latent mixing mea- 
sures in a Bayesian nonparametric setting. Here, the mixing measure G is endowed with a 
prior distribution 11. Assuming an n-sample Xi, . . . , Xn that is generated according to pc^ , 
we study conditions under which the postetrior distribution of G, namely, n( - |Xi , . . . , Xn), 
contracts to the "truth" Go under the dp metric, and provide the contraction rates. In Theo- 
rems [3] and |4] of Section m we establish the convergence rates for the posterior distribution 
for G in terms of dp metri c. These resu l ts are proved using the standard approach of Ghosal, 
Ghosh and van der Vaart lGhosaletal.1 (l200nh . Our convergence theorems have several no- 
table features. They rely on separate conditions for the prior 11 and likelihood function /, 
which are typically simpler to verify than conditions formulated in terms of mixture densi- 
ties. The claim of convergence in Wasserstein metrics is typically stronger than the weak 
convergence induced by the Hellinger metric in the existing work mentioned above. 

In Section |4] posterior consistency and convergence rates of latent mixing measures 
are derived for a number of well-known mixture models in the literature, including finite 
mixtures of multivariate distributions, infinite mixtures based on Dirichlet processes, and 
finite mixtures of Gaussian processes. For finite mixtures with bounded number of atomic 
support, the posterior convergence rate for mixing measure is the minimax optimal n^^/^ 
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under suitable identifiability conditions. For Dirichlet process mixtures defined on W^, 
specific rates are established under smoothness conditions of the likelihood density func- 
tion /. In particular, for ordinary smooth likelihood densities with smoothness /? (e.g., 
Laplace), the rate achieved is (log n/n)'*' for any 7 < (^(i+2){4+(2i3+i)d) ■ ^'^^ supersmooth 
likelihood densities with smoothness /3 (e.g., normal), the rate achieved is (logn)^^^^ . Fi- 
nally, for finite mixtures of Gaussian processes, we are also able to establish a convergence 
rat e by utilizing a result on (sing l e) Gaus sian process prior by van der Vaart and van Zan- 



ten 



van der Vaart and van Zanten ( 2008bh . 



Notations. For ease of notations, we also use /j in place of f{-\9i), and /j in place 
of f{-\6j) for likelihood density functions. Divergences studied in the paper include the 
total variational distance: dv{pG-,PG') ■= ^ I \pg(,x) — pG'{x)\dfi, the Hellinger dis- 
tance: d\{pG,PG') '■= \ I {\/pg{x) — Pg' {x))"^ dfi, and the Kullback-Leibler diver- 
gence: dxiPG^PG') = J PG{x)log{pG{x)/pG'{x))dn. These divergences are related 
by dy/2 < d1 < dy and d| < d^/^. N{e,@,p) denotes the covering number of 
the metric space (0,/?), i.e., the minimum number of e-balls needed to cover the en- 
tire space 0. D{e,Q,p) denotes the packing number of (©,/>), i.e., the maximum num- 
ber of points that are mutually separated by at least e in distance. They are related by 
N{e, e, p) < D{e, G, p) < N{e/2, G, p). Diam(e) denotes the diameter of G. 



2 Wasserstein distances for mixing measures 
2.1 Definition and a basic inequality 

Let (©, p) be a space equiped with a non-negative distance function p such that p{9i, 62) = 
if and only if 9i = 62- If in addition, p is symmetric (/>(0i, 6*2) = p{02, Oi)), and satisfies 
the triangle inequality, then it is a proper metric. A discrete probability measure G on 
a measure space equipped with the Borel sigma algebra takes the form G = Yl\=iPi^di 
for some A; G N U {+cxd}, where p = {pi,P2, ■ ■ ■ ,Pk) denotes the proportion vector, 
while = {61, ... ,6k) are the associated atoms in ©. p has to satisfy < pj < 1 and 
EtiPfc = l. Likewise, G' = Y!] =iP'j^e'. is another discrete probability measure that has 
at most k' distinct atoms. Let Qk{Q) denote the space of all discrete probability measures 
with at most k atoms. Let ^(G) = Ufc>i^fc(G), the set of all discrete measures with finite 
support. Finally, Q{Q) denotes the space of all discrete measures (including those with 
countably infinite support). 

Let q = {qij)i<k-j<k' G [0, l]'"'^'^ denote a joint probability distribution on N+ x N+ 

that satisfies the marginal constraints: Yli=i Qij — P'j Si=i ~ ^ ~ 

1, . . . ,k;j = 1, . . . ,k'. Let Q{p, p') denote the space of all such joint distributions. We 
start with the Li Wasserstein distance: 
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Definition 1. Let p be a distance function on 0. The Wasserstein distance functional for 
two discrete measures G{p, 9) and G'{p', 0') is: 

dp{G,G')= inf (1) 

We focus mainly on the L\ Wasserstein dp, and L2 Wasserstein distance. The latter cor- 
responds to the square root of dp2 in our definition, where Q'^) is replaced by Q'^). 
Note that dp(G, G') < dp2 (G, G') by an application of Cauchy-Schwarz inequality. We will 
consider a variety of choices of distance p in the sequel. 

From here on, discrete measure G G Q{Q)) plays the role of the mixing distribution in a 
mixture model. Let f{x\9) denote the density (with respect to a dominating measure p) of 
a random variable X taking value in X, given parameter ^ G 6. For the ease of notations, 
we also use fi{x) for f{x\6i). Combining G with the likelihood function / yields amixture 
distribution for X that takes the following density: 



Pg{.x) = j f{x\d)dG{d) = ^Pifiix) 



First we state a general result that relates Wasserstein distances between mixing mea- 
sures G, G', namely, dp{G, G') and divergences between mixture densities VGiV'q- Diver- 
gence functionals that play important role in this paper are the total variational distance, 
Hellinger distance, and the KuUback-Leibler distance. All these are in fact instances of a 
broader class of divergence functionals known as the /-divergences (Csizar, 1966; AU & 
Silvey, 1967): 

Definition 2. Let ^ : M — > M denote a convex function. An j -divergence (or Ali-Silvey dis- 
tance) between two probability densities fi and /j is defined as d^{fi^ '^(•^j / fi)fi^l^- 
Likewise, the f -divergence between pc and pc is d(f){pG,PG') = J ^PiPG' /PG)PGdp- 

For0(u) = 1)^ we obtain the squared Hellinger. For (piu) = ^\u—l\ we obtain 

the variational distance. For (j){u) = — log u, we obtain the KuUback-Leiber divergence, 
/-divergence functionals can be used as a distance function or metric on @, motivating the 
following definition. 

Definition 3. When p is taken to be an /-divergence, p{9i,6j) = d^{fi^f'j), for a con- 
vex function (j), the corresponding Wasserstein distance functional is called a composite 
Wasserstein distance: 

dp4>{G,G')= m.i y^qijd^{fi,f'j). 

In particular, dv,dh, dx induce the composite Wasserstein distances dpv, dph, dpK, re- 
spectively. Let dpy^^2 {G, G') denote the composite Wasserstein distance function obtained by 
taking p{6i, 9j) := d!f^{fi, /j). The following result shows that any /-divergence between 
mixture distribution pg,Pg' is dominated by a Wasserstein distance for a suitable choice of 
of distance p. 
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Lemma 1. Let G,G' G G{Q) such that both d^ipc^PG') '^nd dp(f){G,G') are finite for 
some convex function (j). Then, d^[pG,pQi) < dp^{G,G'). 

As will be evident in the sequel, this lemma is also handy in enabling us to obtain lower 
bounds on small ball probabilities in the space of mixture densities pc, in terms of small 
ball probabilities in the metric space {@,p). The latter quantities are typically easier to 
obtain estimates than the former. 

Example 1. Suppose that B = M'^, p is the Euclidean metric, f{x\6) is the multivariate 
normal density N{6, Idxd) with mean 6 and identity covariance matrix, then d\{fi, /j) = 
l-e^^-lPi-e'-f < IWOi-e'-W^. So, dph2{G,G') < dp2{G,G')/8. The above lemma 
then entails that dh2{pG,PG') < dp2{G, G')/8. 

Similarly, for the KuUback-Leibler divergence, since = ^\\di — O'jW"^, by 

Lemmadl dxipcPG') < dpK{G, G') = \dp2{G, G'). Next, suppose that G is a compact 
subset of and consider (j){u) = (log n)^, which is a convex function on [0, oo). We have 
Jfiilogfi/f'jf = am - O'jW, so JpG[log{pG/PG')? < 0{dp2{G,G')). 

For another example, if f{x\6) is a Gamma density with location parameter 6*, © is a 
compact subset in M. Then dxifi^fj) = 0{\Oi — dj\). This entails that dxiPG^PG') < 
0{dp{G,G')). 



2.2 Wasserstein metric identifiability in finite mixture models 

Lemma [U shows that for many choices of p, dp yields a stronger topology on Q{Q) than 
the topology induced by /-divergences on the space of mixture distributions pc- In other 
words, convergence of pc may not imply convergence of G in dp metric. To ensure this 
property, additional conditions are needed on the space of discrete measures ^(G), along 
with identifiability conditions for the f amily of likelih ood functions {f{-\6),9 G ©}. 

The classical definition of Teicher iTeichen (Il96lh specifies the family {f{-\0), 9 £ Q} 
to be identifiable if for any G,G' G ^(G), \\pG — Pg'Woo = implies that G = G' . 
We need a slightly stronger version, allowing for the inclusion for discrete measures with 
infinite support: 

Definition 4. The family {f{-\9),0 G G} is finitely identifiable if for any G G Qq and 

G' G G@, \pg{x) — pg'{x)\ = for almost all X G X implies that G = G'. 



To obtain convergence rates, we also need the notion of strong identifiability of IChen 



(|l995h . herein adapted to a multivariate setting. 



Definition 5. Assume that © C M'^ and p is the Euclidean metric. The family {f{-\0), 9 G 
G} is strongly identifiable if f{x\9) is twice differentiable in 9 and for any finite k and k 
different 9i, . . . , 9k, the equality 



ess sup 



^aJ{x\e,) + pfDf{x\9,) 



+ llD^f{x[ 



(2) 
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implies that ai = 0, /3j 



S for i = 1, . . . ,k. Here, for each x, Df{x\0i) and 



D'^f(x\9i) denote the gradient and the Hessian at 6i of function f{x\-), respectively. 

Finite identifiability is satisfied for the family of Ga ussian distributions ( Teicher , 1960l) . 
see also Theorem 1 of llshwaran and Zarepourl (l2002h . Chen identified a broad class of 
famihes, incl uding the Gaussian family, for which the strong identifiablity condition holds 
(IChenLll995h . 

Define '4>{G, G') = sup^ \pg{x) — PG'{x)\/dp2 [G, G') if G / G' and oo otherwise. 
Also define ipi{G, G') = dv{pG,PG')/dp'2{G, G') ifG ^ G' and oo otherwise. The notion 
of strong identifiability is useful via the following key result, which generalizes Chen's 
result to of arbitrary dimensions. 



Theorem 1. (Strong identifiability). Suppose that is compact subset ofR , the family 
{f{-\9),6 G 0} is strongly identifiable, and for all x £ X, the Hessian matrix D'^f{x\9) 
satisfies a uniform Lipschitz condition 



\-f'^{D''f{x\6^)-D^f{x\92M < Cp{6^,62fM 



(3) 



for all x, 61,62 and some fixed C and 5 > 0. Then, for fixed Gq € ^a,(0), where k < oo." 



lim inf <; MG, G') : dp(Go, G) V dJGo, G') < e ^ > 0. 
£^OG,G'Gefc(e) ' 



(4) 



The assertion also holds with ip being replaced by tpi. 



Remarks, (i) In Section [5^ the notion of strong identifiability is extended to an infinite 
dimensional setting via first and second order Frechet derivatives in normed spaces, 
(ii) Suppose that Gq has exactly k support points in 0. Then, an examination of the proof 
reveals that the requirement that be compact is not needed. Indeed, if there is a sequence 
of G„ G Gk{®) such that dp{Go, Gn) — 0, then it is simple to show that there is a sub- 
sequence of Gn that also has k distinct atoms, which converge in p metric to the set of k 
atoms of Go (up to some permutation of the labels). The proof of the theorem proceeds as 
before. 

For the rest of this paper, by strong identifiability we always mean conditions specified 
in Theorem [T] so that Eq. dlj) can be deduced. This practically means that the conditions 
specified by Eq. ^ and Eq. (O be given, while the compactness of may sometimes be 
required. 



2.3 Wasserstein metric identifiability in infinite mixture models 

Next, we state a counterpart of Theorem [T] for G, G' G ^(0), i.e., mixing measures with 
infinitely many support points. We restrict our attention to convolution mixture models on 
W^. That is, the likelihood density function f{x\6), with respect to Lebesgue, takes the 
form f{x — 6) for some multivariate density function / on W^. Thus, pg{x) = G * f{x) = 

EllP^fi^ - (^i) andpG(x) = G' * fix) = Y!]=iP'J{x " O'X 
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As before, we need a compactness condition for Q. Additional key assumptions concern 
with the smoothness of density function /. This is characterized in terms of the tail behavior 
of the Fourier transform of /. We consider both ordinary smooth densities (e.g., Laplace and 
Gamma), and supersmooth densities (e.g., normal). The following result does not require 
that G, G' be discrete. 

Theorem 2. Suppose that G, G' are probability measures that place full support on com- 
pact set G C W^. f is a density function on that is symmetric (around 0), i.e., fdx = 
f-^A f^-^f^^ '^^y ^orel set A C W^. Moreover, assume that f[uj)^ Ofor any uj G M'^. 

(1) Ordinary smooth likelihood. lf\f{uj) 11^=1 1'^il'^l ^ '^o as ojj —f oo(j = 1, . . . , d) 
for some positive constant do and constant f3. Then for any m < 4/(4 + (2/3 + l)d), 
there is some constant C{d, /3, m) dependent only on d, (3 and m such that 

dp2{G,G') < Cid,(3,m)dviPG,PG'r, 

as dv{pG,PG') 0. 

(2) Supersmooth likelihood. if\f{^j^)Y['j=i^'^Pi\'^j\^/^) — do as ojj — )• oo{j = 
1, . . . ,d) for some positive constants $, 7, do. There is some constant C{d, (3) de- 
pendent only on d and (3, such that 

dp2{G,G') < G{d,P){-\ogdv{pG.PG')r'"^ 
as dviPG,PG') 0. 

Example 2. For the standard normal density on R'^, f{uj) = YYj=i ^^P '^^ o^" 

tain that dl{G,G') ~ {- log dy (pg , Pg'))~^ as dp{G,G') (so that dv{pG,PG') ^ 
0, by Lemma [1]). For a Laplace density on M, e.g., f{ui) = jj^, then d'^{G,G') ~ 
dv{PG,PG'T for any m < 4/(4 + 5d), as dp{G, G') ^0. 



3 Convergence of posterior distributions of mixing measures 

We are ready to study the convergence of discrete mixing measures in a Bayesian setting. 
Let Xi, . . . , Xn be an iid sample according to the mixture density pg{x) = f f{x\6)dG{6), 
where / is known, while G = Go for some unknown mixing measure in Qj.{Q). The 
number support points k may not be known. In the Bayesian framework, G is endowed 
with a prior distribution 11 on a suitable measure space of discrete probability measures in 
^(0). The posterior distribution of G is given by, for any measurable set B: 

„ n „ n 

n(S|Xi,...,X„)= / \[pG{Xi)Yi{G)/ / \{pG{Xi)Yi{G). 

-'^1=1 i=i 
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We shall study conditions under which the posterior distribution is consistent, i.e., it 
concentrates on arbitrarily small dp neighborhoods of Go, and establish the rates of the 
converge nce. The analysis is b ased upon the general framework of Ghosal, Ghosh and van 
der Vaart Ghosal et al. ( 2000|) . who analyzed convergence behavior of posterior distribu- 



tions in terms of /-divergences such as Hellinger and variational distances on the mixture 
densities of the data. In the following we formulate two convergence theorem s for mixture 



model setting (which can be viewed as counterparts of Theorem 2.1 and 2.4 of lGhosal et al. 



dioOO)). A notable feature of our theorems is that conditions (e.g., entropy and prior con- 
centration) are stated in terms of the Wasserstein metric, as opposed to /-divergences on 
the mixture densities. They may be typically separated into independent conditions for the 
prior for G and the likelihood family and are simpler to verify for mixture models. In ad- 
dition, the convergence of the posterior distribution of mixing measures is established in 
terms of Wasserstein distance metrics. 

The following notion plays a central role in the theorem's formulation. 

Definition 6. Let Q be a subset ofQ{Q). For each k < oo, define the Hellinger information 
of dp metric as a real-valued function on the real line Ck{G,-) : K — K-' 

Ck{G,r)= inf dl{pG„PG)/2. (5) 

Go&gkie),G£g:dp{Go,G)>r/2 

It is obvious that is a non-negative and non-decreasing function. The following char- 
acterization of Cfc follows immediately from the results obtained in the previous section. 

Proposition 1. (a) If Q and Qk{Q) cire both compact in the Wasserstein topology, and 
the family of likelihood functions is finitely identifiable. Then, Ck{G,r) > for any 
r > 0. 

(b) If Q C MJ^ is compact, and the family of likelihood functions is strongly identifiable 
as specified in Theorem [7] Then, for each k there is a constant c{k) > such that 
Ck{gk{Q),r) > c{k)r^ for all r > 0. 

(c) IfQ G M'^ is compact, and the family of likelihood functions is ordinary smooth with 
parameter (3, as specified in Theorem |2] Then, for any e > 0, there is some constant 
c{d,l3) such that Ck{g{Q),r) > c{d, ^)r'^+^'^'^+'^^'^+^ for any r > 0, and any k > 0. 
For supersmooth likelihood family, we have Ck{G{&),r) > ex.p[—c{d, (3)r^^] for 
any r > and any k > 0. 

The following two theorems have three types of conditions. The first is concerned with 
the size of support of 11, often quantified in terms of its entropy number. Estimates of 
the entropy number defined in terms of Wasserstein metrics for several measure classes of 
interest are given in Lemma |2] The second is on the Kullback-Leibler support of 11, which 
is related to both space of discrete measures Q (0) and the family of likelihood functions 
f{x\6). The Kullback-Leibler neighborhood is defined as: 
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The third condition is on the HeUinger information of dp metric, function Ck{G,r),a char- 
acterization of which is given above. 

Theorem 3. Let Gq E Gki^) ^ Q{^) for some k < oo, and the family of likelihood func- 
tions is finitely identifiable. Suppose that for a sequence (e„)„>i that tends to a constant 
(or 0) such that ne^ — )• oo, sets Qn C ^(6) and a constant C > 0, we have 

dp) < nel (7) 
U{g{e)\gn)<exp[-nel{C + A)], (8) 
n(i?(e„)) > expi-nelC). (9) 

Moreover, suppose M„ is a sequence so that 

Ck(gn,Mnen)>eliC + A), (10) 
exp(2ne^) ^ exp[-nCfe(^n, je„)] ^ 0. (11) 

j>M„ 

Then, Yi{G : dp{Go, G) > M„en|Xi, . . . in Pco-probability. 

A stronger theorem (using a substantially weaker condition on the covering number) 
can be formulated as follows. 

Theorem 4. Let Gq G ^/fc(0) ^ Q{Q) for some k < oo, and the family of likelihood 

functions is finitely identifiable. Suppose that for 

bounded away from or tending to infinity, and sets Gn C ^(0), we have 

log D{e/2, {G£gn:e< dp(Go, G) < 2e}, dp) < nel for every e > e„, (12) 

y^fr^i ^^ = o(exp(-2ne„)), (13) 

n(G:ie„ <dp(G,Go) < 2ie„ 



n(s(e„)) 

where Mn is a sequence such that 



< exp[nCk{Gn,jen)/2] for any j > Mn, (14) 



exp(2ne2) ^ exp[-nCfe(g„, je„)/2] ^ 0. (15) 

j>M„ 

Then, we have that Il{G : dp{Go, G) > M„e„|Xi, . . . , Xn) — >■ in PcQ-probability. 

Remarks, (i) The above statement continues to hold if conditions ([141) and ([T5]) are re- 
placed by the following condition: 

exp{2nel)/U{B{en)) ^ exp[-nCfc(g„, je„)] ^ 0. (16) 

j>M„ 
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(ii) The above theorems are stated for the Li Wasserstein metric dp, but they also hold for 

1/2 

L2 Wasserstein metric , with a slight modification of the definition of the Helhnger 
information function to Ck{Q, r) = inf//2((5g q)>^/2 ^UpGo^Pg)- 

Before moving to specific examples, we state a simple lemma, which provides estimates 
of the entropy under dp metric for a number of classes of discrete measures of interest. 
Because dp inherits directly the p metric in @, the entropy for classes in {Q{@),dp) can 
typically be bounded in terms of the covering number for subsets of {Q,p). These bounds 
will be used extensively in the sequel. 

Lemma2. (a) log N{2e,gk{&), dp) < k{log N {e, Q, p) +log{e + eDiam{e)/e)). 

(b) log N{2e,g{e), dp) < N{e,e, p)log{e + eDiam{e)/e). 

(c) Let Gq = Yli=i Pi^6* ^ Gk{®)- Assume that M = max^^-,^ l/p^ < 00 and 
m = minjj<fc ^*) > 0. Then, 

IogiV(e/2,{G G GkiQ) ■ dpiGo,G) < 2e},dp) 

< k{suplogN{e/A, Q' , p) +log{S2kDiam{e)/m)), 
0' 

where the supremum in the right side is taken over all bounded subsets 0' C such 
thatDiam{Q') < 4Me. 

4 Examples 

In this section the general theory is illustrated in specific mixture models, including finite 
mixtures of multivariate distributions, infinite mixtures based on Dirichlet processes, and 
finite mixtures of Gaussian processes. 

4.1 Finite mixture of multivariate distributions 

Let O be a subset of W^, p be the Euclidean metric, and 11 is a prior distribution for discrete 
measures in ^^(0), where A; < 00 is known. Suppose that the "truth" Gq = Yli=iPi^0* ^ 
Gki&)- To obtain the convergence rate of the posterior distribution of G, we need: 

Assumptions A. 

(Al) is compact and the family of UkeUhood functions f{-\0) is strongly identifiable. 

(A2) For some positive constants Ci, C2, dK{fi, /j) < Cip\9i, 6'^) and / fi[log{fi/ f'^)f < 
C2pHdu0j)foTmyei,9'^ G 0. 

(A3) Under prior U, for small 6 > 0, C36'' < U{\pi - p*\ < S,i = 1 . . . ,k) < C3S'' and 
036'"^ < U{p{ei, 9*) <5,i = l...,k) < C3S'"^ for some constants C3, C3 > 0. 
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Remarks. (Al) and (A2) hold for the family of Gaussian densities with mean parameter 
6. (A3) holds when the prior distribution on the relevant parameters behave like a uniform 
distribution, up to a multiplicative constant. 

Let G = Yli=iPi^di- Combining Lemma[T]with Assumption (A2), if p{9i,6*) < e 
and \pi -p*\ < e2/(/cDiam(e)2) for i = l,...,k, then dxiPGo^PG) < dpK{Go,G) < 
ClEl<^,j<kHjp\^h^j)^i°^^^y Q ^ Q- Thus, dK{PGo,PG) < Cidp2iGo,G) < GiEtiiP*^ 
Pi)p^{e*,ei) + Ci \Pi - Pr|Diam(G)2 < 2Gie^. Hence, under prior H, 

n(G : dKiPcPc) < e') > n(G : piOi, 6*) < e, \pi-p*\ < eV(fcDiam(G)2), i = l,...,k). 

Similarly, due to (A2), f PGo[^'^s{pGo/pg)]'^ < C'2'^p2(Go, G)). Thus, in view of Assump- 
tion (A3), we have U{B{e)) ~ e''('^+2). Conversely, for sufficiently small e, if (i^(Go, G) < 
then by reordering the index of the atoms, we must have p{9i, 6*) = 0(e) and \pi —p*\ = 
O(e^) for alH = 1, . . . , A: (see the argument in the proof of Lemma|2tc)). This entails that 
under the prior 11, 

n(G : d^2(Go, G) < e2) < n(G : 9*) < 0(e), \pi-p*\ < 0{^),i = l,...,k)^ e'=('^+2). 

Let En = tT^I'^ . We proceed by verifying conditions of Theorem |4j as this theorem pro- 
vides the right rate for parametric mixture models under the L2 Wasserstein distance metric 
ii^. Let Qn ■■= Gk{Q)- Then U{g{Q) \ Gn) = 0, so Eq. (l3 trivially holds. 

Next, we show that D{e/2, S,d^^i'^), where S = {G e Gn ■ dp{Go,G) < 2e}, is 

bounded above by a constant, so that ([T2l) is satisfied. Indeed, for any e > 0, log D(e/2, S, d^a^) < 

log N {e/ 4, S,d^pi'^) < N{e/i,S,dp). By Lemma|2](c), 7V(e/4, S", dp) is bounded in terms 
of supQ/ log N{e/8, p), which is bounded above by a constant when 0''s are subsets of 
whose diameter is bounded by a multiple of e. Thus, Eq. ([T2l ) holds. 

ByProposition[I];b)andAssumption(A4),thereisGfc(^n, jen) = infd^2(Go,G)>(je/2)2 d\{PGo,PG) > 
c(je„)^ for some constant c > 0. To ensure condition (fTSl ). note that: 

exp(2ne^) ^ exp[-nGfc(^n,je„)/2] < exp(2ne^) ^ exp[-nc(je„)^] 

j>M„ j>Mn 

~ exp(2ne^ - ncM^e^). 

This upper bound goes to zero if ncM^e^ > 4ne^, which is satisfied by taking M„ to be a 
large multiple of en^^'^. Thus we need Mnf-n x = n~^/^. 

Under the assumptions specified above, n(G : je„ < dp(G, Go) < 2jen)/n(i?(e„)) = 
0(1). On the other hand, for j > Mn, we have exp[nCfc(^„, je„)/2] > exp[nc(je„)^/2] 
which is bounded below by arbitrarily large constant by choosing M„ to be a large multiple 
of en , thereby ensuring (fT4l) . 

Thus, by Theorem SI rate of contraction for the posterior distribution of G under 

distance metric is n~^^^, which is also the minimax rate n~^/^ as proved in the univariate 
case by iChen (,1995i) . Our calculation is summarized by: 
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Theorem 5. Under Assumptions (A1-A3), the contraction rate in the L2 Wasserstein dis- 
tance metric of the posterior distribution ofG is n~^/^. 

4.2 Infinite mixture based on the Dirichlet process 

Given the "true" discrete measure Go = Yli=iPi^di ^ Qk{Q), where G is a metric space 
but /c < 00 is unknown. To estimate Go, the prior distribution 11 on discrete measure 
G G ^(0) is taken to be a Di r ichlet process DP(z/, Pq) that centers at Pq with concentration 



parameter u > iFergusonI (|1973h . Here, parameter Pq is a probability measure on 0. 



For any m > 1, the following lemma provides a lower bound of small ball probabilities of 



metric space (^(©), dim™) in terms of small probabilities of metric space (B, p). 



Lemma 3. Let G ~ DP{v, Pq), where Pq is a non-atomic base probability measure on a 
compact set Q. For a small e > 0, let D = D{e,Q, p) denote the packing number of 
under p metric. Then, under the Dirichlet process distribution. 



D 

n(G : (ipm(Go,G) < {2"" + l)€'^) >T{u){e'^D-^ /Diam{Q)'^)^-^u^\lPQ{Si). 

1=1 

Here, {Si, . . . , S^)) denotes the D disjoint e/2-balls that form a maximal packing of 0. 
r(-) is the gamma function. 

Proof. Since every point in is of distance at most e to one of the centers of S*!, . . . , Sd, 
there is a D-partition {S'l, . . . , S'^) of 0, such that Si C S[, and Diam(5') < 2e. for each 
i = 1, . . . , L). Let TUi = G{S'j), fii = Po('S'j')' and pi = Gq{SI). From the definition of 
Dirichlet processes, m = (mi, . . . ,m£)) ~ Dir(z^^i, . . . , vfio)- Note that 

dpm{Go,G) < (2er + ||m-p||i[Diam(0)]'". 

Due to the non-atomicity of Po, for e sufficiently small, upi < 1 for all i = 1, . . . , D. 
Let 6 = e/Diam(0). Then, under H, 



Pr(d^(Go, G) < (2™+l)e'-) > Pr(||m-p||i < 5™) > Pr(|m,-p,| < i = 1, . . . , 

Ui=l r(i^/^i) f=i imax(,3,-5™/D,0) t=\ 

The second inequality is due to (1 — J2i=i^ miyf^^^^ = m^^"^^ > 1, since vpD < 1 
and < m£) < 1 almost surely. The third inequality is due to the fact that T{a) < l/a for 
< a < 1. This gives the desired claim. □ 
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Assumptions C. 

(CI) The non-atomic base measure Pq places full support on a compact set Q. The family 
of the likelihood densities f{-\6) is finitely identifiable. 

(C2) For some constants C7i, mi, C2,m2 > 0,dxU,/j) < C^p^' {9^, 6'^) md f fi[log{f^/ f^)]^ < 
Cip'^HOuO'^) for any 6i,9'j G 6. 

(C3) Pq places sufficient probability mass on all small balls that pack G. Specifically, 
there is a universal constant C3 > such that the probability of the L>-partition 
(^i, . . . , Sd) specified in Lemma[3]satisfy for any e > 0: 

D 

logJ]Po(Si) >C3Dlog(l/D). 

1=1 

(C4) 9 C M'^ is compact, so that the packing number D{e, 6, p) x [Diam(6) /ef. 

Theorem 6. Given Assumptions ( C1—C4), and the smoothness conditions for the likelihood 
family as specified in Theorem^ there is a sequence \ such that Yi{dp{GQ, G) > 
/3„|Xi, . . . , Xn) —5- in Pgq probability. Specifically, 

2 

(1) For ordinary smooth likelihood functions, take f3n x (log n/n) (''+2)(4+(2/3+i)d)+*, Jor 
any small 5 > 0. 

(2) For supersmooth likelihood functions, take j3n x (logn)^^/^. 

Proof. The proof consists of two main steps. First, we shall prove that under Assumptions 
(C1-C4), conditions specified by Eqs. ([71) ([D (|9l) in Theorem [3] are satisfied by taking 
= Q{Q), and e„ to be a large multiple of (log n/n)^/('^+^). The second step involves 
constructing a sequence of M„ and subsequentially /3„ = M„e„ for which Theorem |3] can 
be applied. 

Step 1: By Lemma [Hand (C2), dxiPGo^PG) < dpK{Go,G) < Gidp^i{Go,G). Also, 
/ PGoPosIpGo/pg)]^ < C2dp^2{GQ,G). Without loss of generahty, assume that mi < 
m2. We obtain that U{G G B{en)) > n(G : dpmi(Go,G) < C^e^) for some constant 
C3. Combining this bound with (C3) and (C4), which are applied to Lemma |3] we have: 
logn(G G B{en)) ^ {D- l)log(e„/Diam(e)) + {2D - l)log(l/Z)) + Dlogi^, where 
the approximation constant is dependent on mi,m2. Note that D x [Diam(©)/e„]'^. It is 
simple to check that condition ^ holds, logn(G' G B{en)) > —Gne^, by the given rate 
of e„, for any constant C > 0. 

Since Qn = G{Q), ^ trivially holds. Turning to condition by Lemma|2tb), we have 
log N{2en,g{e), dp) < iV(e„,G,p)log(e + eDiam(G)/e„) < (Diam(G)/e„)'^ log(e + 
eDiam(G)/e„) < ne^ by the specified rate of e„. 

Step 2: For any Q C ^(G), let Rk{Q,r) be the inverse of the Hellinger information 
function of dp metric. Specifically, for any t > 0, 

Rk{Q.t) = mi{r>Q\Gk{Q,r)>t}. 
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Note that Rk{G, 0) = 0. RkiQ, •) is non-decreasing because Ck{Q, •) is. 

Let {en)n>i be the sequence determined in the previous step of the proof. Let M„ = 
Rk{G{e), eliC + 4))/e„, and = iV4e„ = i?fe(g(e), el{C + 4)). Condition GO]! holds 
by definition of iJfc, i.e., Cfc (^(6), M„e„) > e^(C+4). To verify ([TT]). note that the running 
sum with respect to j cannot have more than Diam(G)/e„, and due to the monotonicity of 
Cfc, we have 

exp(2ne^) ^ exp[-nCfc(^n, jcn)] < Diam(e)/e„ exp(2ne^ - nCfc(^„, M„e„)) 0. 

j>M„ 

Hence, Theorem [3] can be applied to conclude that Il{dp{Go, G) > /3„|Xi, . . . , Xn) — ?• 
in probability. Under ordinary smoothness condition (as specified in Theorem 

1 2 

Rk{G{&),t) = 1 4+(2/3+i)d+i ^ where 5 is an arbitrarily positive constant. So, /3„ x ^i+m+i)d+5 ^ 

2 

(log n/n) (d+2){4+(20+i)d+s) On the other hand, under supersmoothness condition, Rk{G{&),t) = 
(l/log(l/t))i//3. So, /3„ X (log(l/e„))-V/3 X (logn)-V/3. 

□ 



4.3 Finite mixture of Gaussian processes 

We now study an example in which 6 has infinite dimensions. Specifically, let T = [0, 1], 
and © = loo{T) is the Banach space of bounded functions : T — )■ M, equipped with the 
uniform norm ||^|| = sup{|6'(t)| : t G T}. Suppose that the "true" Go has k distinct atoms 
in 0, Go = J2i=i Pi^9* ' where k is known. 

We shall consider a "mixture of Gaussian processes" prior IT on Qk{®)- Specifically, a 
random draw G from 11 is a discrete measure taking the form G = J2i=i Pi^di, where 9i are 
independent random sample paths distributed according to a zero-mean Gaussian process. 
Let ii' : T X T — )• M be the covariance function that defines the Gaussian process — we 
assume further that the Gaussian process has bounded sample paths and that it is Borel 
measurable. 

It is a known fact that the support of the defined zero-mean Gaussian measure is equal to 
the closure of the reproducing kernel Hilbert spa ce (RKHS), to be denoted by IHI(K), or sim- 



ply H, of the covariance kernel K of the process (|Kallianpuii . 1 19711 : Ivan der Vaart and van Zanten , 



2008al) . Assume that 6** G H for all i = 1, . . . , /c, so that the "true" Go is contained within 
the support of prior 11. 

An ingredient of our analysis is drawn from a recent result by van der V aart and van Zan- 



ten, wh o studied asymptotic behavior of priors based on Gaussian processes Ivan der Vaart and van Zanten 



(l2008br) . Define p{6i, 9j) = sup \9i{t) - 9j{t)\ : t e T. A key notion in their analysis is 



concentration functions. For each i = 1, . . . ,k, define the concentration function: 

<Pet{e) = ^ ^inf - logPr(||0|| < e), 

where Pr denotes the probability under the Gaussian process. 
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Another ingredient is the extension of the notion of strong identifiabiUty to function 
space Q, see Section \5?2[ so that the result of Theorem [T] continues to hold for the infinite 
dimensional 0. Finally, we need additional assumptions: 



Assumptions B. 

(Bl) The family of likelihood functions {f{-\9), 6* G G} is strongly identifiable. 

(B2) For some positive constants Ci, C2, /j) <Cip2(^.,0p and/ /i[log(/i//j)]2 < 

C2pHei,e'j) fov my 6i, 9'^ e 6. 

(B3) Under prior H, for small 6 > 0, Il{\pi — p*| < S,i = 1 . . . ,k) > 038^°' for some 
constants C3, a > 0. 

(B4) Under prior H, C4 = E||6lf < 00. 

Theorem 7. Given Assumptions (B1—B4). Let (e„)„gN be a sequence of positive numbers 
tending to such that log n = o{ne'^) and that for any i = 1, . . . , k and any n, 

h*{en)<nel. (17) 

1 /2 

Then, for a sufficiently large constant M > 0, Il{dp{Go, G) > Men . . . , Xn) — )• 

in -probability. 



Remarks, (i) The reader is referred to lvan der Vaart and van ZantenI (l2008bl) for examples 



of convergence rates e„ that satisfy condition dTT] ) for different choices of G. In particular, 
if under the Gaussian process prior, the supporting atoms 9i of G ~ 11 are functions on 
T = [0, 1] with smoothness 71 > 0, while the "true" support points 9*'s of Go are functions 
with smoothness 72 > 0, then concentration functions 4>0*{e) = e~'^/^'^'^^'^^^ for each i = 

1, . . . ,k. Accordingly, the rate e„ for which Eq. (fTTl ) holds is e„ x n 271A72+1 . The 

71^72 

contraction rate for the posterior distribution of G is n 2(271 A72+1) . 

(ii) We currently do not have a concrete example of likelihood functions f{-\9) satisfying 
strong identifiability conditions. We plan to explore this issue in a future work. 



5 Proofs of main results 

5.1 Identifiability results 
Proof of TheoremlU 

Proof. Suppose that Eq. ([Hi is not true, then there will be sequences of G„ and G^ tending 
to Go in dp metric, and that tp{Gn, G^) 0. We write G„ = Xli^i Pn^i^e^^i, where pn,i = 
for indices i greater than the number of atoms of G„. Similar notation is applied 
to G^. Since both G„ and G'„ have finite number of atoms, there is g^") G QijPmP' n) 
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sothatd^2{Gn,G'J = EijQt^^pHOn,i,e'^^j). Note that d2(G„,G;) < dp2{Gn,G'J = 
0{dp{Gn, G'n)), while the latter inequality is due to the boundedness of 0. 

LetO„ = : pHen,^,e'^^ < for some 5' G (0,1). Then, 



E(n) 



as n — > oo. Since G Q(p„,p'„), 



we can express 



sup 



sup 



2 = 1 



/dp2{Gn,G'n), 



and, by Taylor's expansion. 



sup 



E 



(i,i)6Cn 

E 

/dp2{Gn,G'J 

sup + + Cn{x) + Rn{x)\/Dn, 



+ 



where 

Rnix) = O 



E 

{«,i)eCr, 



due to Eq. Q and the definition of On- So Rn{x) / dp2{Gn,G'^) — )• 0. The quantities 
An{x),Bn{x) and Cn(x) are linear functional of f{x\9), Df{x\d) and D'^f{x\0) for dif- 
ferent 6''s, respectively. Since is compact, subsequences of Gn and G'^ can be chosen so 
that each of their support points converges to a fixed atom O^, for I = 1, . . . , k* < k. 

After being properly rescaled, the limits of An{x),Bn{x) and Gn{x) are still linear 
functionals with constant coefficients not depending on x. In particular, Cn{x)/Dn — 

some 7j, 's and not all these coefficients vanishing, since Ylij=i 
1. The coefficients in An{x)/Dn and Bn{x)/Dn can go either to infinity or to a con- 
stant by further selecting the subsequences of G„ and G'^- If they go to infinity, a se- 
quence dn = 0{1) can be found such that dnAn {x)/Dn converges to Yl^j=i '^j f{A^*j) ^^'^ 
dnBn{x) / Dn convcrgcs to PjDf{x\6*) for some finite aj and Thus, we have 
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dn and Oj, /3j, 7^, not all being zero, such that 

k* 

i=i 

(18) 

for all X. This entails that the right side of the preceeding display must be for all almost 
all X. By strong identifiability, all coefficients must be 0, which leads to contradiction. 

With respect to ipi{G,G'), suppose that the claim is not true, which implies the ex- 
istence of a subsequence Gn,G'^ such that that — )• 0. Going through the 
same argument as above, we have aj,f3j,'^j, not all of which are zero, such that Eq.(fT8]) 
holds. An application of Fatou's lemma yields / 1 + Df{x\6j) + 

D"^ f{x\6j)'^j\d^ = 0. Thus the integrand must be for almost all x, leading to con- 
tradiction. □ 

Proof of Theorem 111 

Proof. To obtain an upper bound of dp2{G,G') in terms of dyipcPG') under the con- 
dition that dy{pG,PG') ~^ 0, our strategy is approximate G and G' by convolving these 
with some moUifier K^. By triangular inequality, dp2{G, G') can be bounded in terms of 
dp2{G, G* Ks), dp2{G', G' * Ks), and dp2{G * Ks, G' *Ks). The first two terms are simple 
to bound, while the last term can be handled by expressing G * Ks a& the convolution the 
mixture density pG with another function. This trick \ yas widely exp l oited in kern el density 
estimation method for deconvolution problems (e.g. Jzhangl (119901) : iFanI (1199 Ih ). We also 
need the following elementary lemma. 

Lemma 4. Assume that p and p' are two probability density functions on with bounded 
s-moments. 



\p{x) -p'{x)\\\x\\'dx < 2\\p-p'\\^l;'^/'{Ep\\Xr + Ep,\\Xrf/\ 



(a) For t such that < t < s, 

(b) Let Vd = '7r'*/^r(d/2 + 1) denote the volume of the d-dimensional unit sphere. Then, 

Take any s > 0, and let K : ^ (0, 00) be a symmetric density function on 
whose Fourier transform K is a continuous function whose support is bounded in [—1,1]'^. 
Moreover, K has bounded moments up to order s. Consider molifiers Ks{x) = -^K{x/6) 
for 6 > 0. Let Ks and / be the Fourier transforms for Ks and /, respectively. Define gs to 
be the inverse Fourier transform of Ks/ f: 
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Note that function Ks{uj)/ f{uj) has bounded support. So, gs G and gs := 

Ks{lo)/ f{uj) is the Fourier transform of gs. By the convolution theorem, f * gs = Ks- As 
a result, 

G * Ks = G * f * gs = PG * gs- 
Then the second moment under Ks is 0((5^). It entails that dp2 (G, G*Ks) = 0{5'^). By 



triangular inequality, < G*K5)+d^^^(G', G'* 



kernel K, 



Ks) < (G * Ks, G' * Ks) + 0(6), so for some constant G{K) > dependent only on 



dp2{G, G') < 2d„2(G * Ks, G' * Ks) + G{K)6^ 



(19) 



Theorem 6.15 of IVillanil (120081) provides an upper bound for the Wasserstein distance: 
for any two probability measures /i and u, dp2 (/i, u) < 2 f — I'Kx), where |/x — 

is the total variation of measure |/x — Thus, 



dp2{G*Ks,G' *Ks) < 2 j \\xf\G*Ksix)-G' *Ks{x)\dx. 



(20) 



We note that since density function K has bounded s-th moment, J HxH'^G * Ks{dx) < 
2'[J \\e\\'dG{9) + J \\x\\''Ks{x)dx = 2'[J \\e\\''dG{e) + 6'' J WxW" K{x)dx] < 00, because 
G's support points lie in a compact set. Applying Lemma |4] to Eq.(|20l). we obtain that for 

6<l, 



dp2{G*Ks,G' *Ks) < G{d,K,s)\\G*Ks-G'*Ks\fl;^'^^' 

< G{d,K,s)\\G *Ks-G'* i^,||M/(cZ+2s)_ 



(21) 



Here, constants G{d, K, s) are different in each line, and they are dependent only on d, s 
and the s-th moment of density function K. 

Next, we use a known fact that for arbitrary (signed) measure /i on M'^ and function 
g € L2(M'^), there holds * gW^^ < \fJ-\ II5IIL2' where \^\ denotes the total variation of /i: 

{PG-PG')*95\\l2 < '^dviPG,PG')\\95\\L2- 

(22) 



\\G*Ks-G'*Ks\\l2 = \\PG*9S-PG'*gs\\L2 
By Plancherel's identity. 



I Il2 
\95\\l2 



-duj 



K{oj5f 



du < C 



f{u) "^dhj. 



The last bound holds because K has support in [—1, 1]*^, and is bounded by a constant. 
Collecting Eqs. ([T9l ) (l20l)(|2TI)(|22l) and the preceeding display, we have: 



2(a-2) 



dp2{G,G')<G{d,K,s){ inf 5^ + dv{PG,PG')^ 

5g(0,1) 



f{u)-^du 



-l/<5,l/<5]<* 



s-2 
d+2s 
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If \f{uj) 0^=1 l^jl'^l ^ do as CO j — )• oo(j = 1, . . . , d) for some positive constant do, 
then 

dp2{G,G') < G{d,K,s,/3)\ inf 6^ + dv{pG,PG')^{l/S)^^ 

[ <5e(o,i) 

< G{d, K, s, i3)dv{pG,PG>)^'^''+^''^+^^^+^^''^'^^'' ■ 

The exponent tends to 4/(4 + (2/3 + as s oo, we obtain that dp2{G,G') < 
C{d,f3,r)dv{pG,PG'Y> for any constant r < 4/(4 + (2/3 + l)(i), as dv{pG,PG') — ^ 0. 

If |/(a;) 11^=1 sxpdcjjl'^) > do as c<jj — )• oo(j = 1, . . . , d) for some positive constants 
(3, do, then 

d2(G,G') < C(d,i^,s,/3) j inf +dy(pG,PG0'^'~'^/^'+''^^exp-2dr^-^|. 

t <5e(o,i) a + 2s J 

Taking J"'' = log (iy(pG,PG')' we obtain that dp2 (G, G') < C{d, l3){- log dv{pG,PG')) 

□ 



Proof of Lemma [TJ 

Proof. We exploit the variational characterization of /-divergences dNguyenetallboinh . 



d(f,{fi, f'j) = sup^.. f (fijf'j — 4>*{(fiij)fidfi. Here, the infimum is taken over all measurable 
function on X. (j)* denotes the Legendre-Fenchel conjugate dual of convex function (p. ((/)* 
is again a convex function on R and is defined by (/)*(u) = sup^gjj (uv — (p{u)).) Thus, 
dp^{G, G') = infqgQ(p p/) Y^ij Qij sup^^^, / - (j)* {ipij) fi. On the other hand, for any 

q e Q(p,p')> 

d,t>{pG,PG') = sup / (ppc -(l)*{^)pG = s\\p / ^"^p'jf'j - (p*i^)'^Pifi 

f J ^ i 

= sup / Qij f'j - 0*(95) ^ Qijfi = sup / ^ Qiji^fj - <t)*{^)f'i) 

U «J «J 

ij 'P^J J 

where the last inequality holds because the supremum is taken over a larger set of functions. 
Moreover, the bound holds for any q € Q{p,p'), so d^{pG,PG') ^ dp(p{G, G'). □ 



Proof of Proposition [H 

Proof, (a) Suppose that the claim is not true, there is a sequence of (Go, G) G Qk{Q) x G 
such that dp (Go, G2) > r/2 > always holds, and that converges in dp metric to Gq G Qk 
and G* G ^, respectively. This is due to the compactness of both ^^(6) and Q. We must 
have dp{GQ,G*) > r/2 > 0, so Gg / G*. At the same time, dh{pG*,PG*) = 0, which 
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implies that pc* = Pg* for almost all x £ X. By finite identifi ability condition, Gq = G*, 
which is a contradiction. 

(b) is an immediate consequence of Theorem [T] by noting that under the given hy- 
pothesis, there is c{k) > depending on k, such that d1{pGo,PG) > '^v^PGo^Pg)/'^ > 
c{k)(p2 (Co, G) > c{k)dp{Go, G) for sufficiently small dp{GQ, G). The boundedness of 
implies the boundedness of dp{Go, G), thereby extending the claim for the entire admissible 
range of dp{Go, G). (c) is obtained in a similar way from Theorem[2] □ 



5.2 Strong identifiability conditions in normed spaces 

Let (0, II • II) be a real normed space. A continuous function / : B — )• M is twice Frechet 
differentiable at a point 6* G if it is Frechet differentiable at with Frechet derivative 
Def{6*) (which is a bounded linear function from 6 into M), and there is a continuous 
bilinear function Dgf{6*] •, •) from 0x0 into M, called the second Frechet differential of 
/, which has the property that 

lim |/(r + 7) - /(r ) - Defiem - Dlf{0*-n, ^)\/hf = o. 

We say that the family of density function {f{-\9), 9 e 0)} is strongly identifiable if 
f{x\6) is twice Frechet differentiable in 9 (with the Frechet derivative Dgf{x\9){-), and the 
second Frechet differential Dgf{x\9; •, •)), and for any finite k and k different 9i, ... ,9k, 
the equality 



ess sup 



aif{x\9i) + Defix\9,)(3i + Djf{x\9i; y„y, 



i=l 



(23) 



implies that = 0, /3i = 7^ = G for i = 1 , . . . , fe. 

Note that if for each x £ X, f{x\-) is twice Frechet continuously differentiable, i.e., 
Dq{x\-;-,-) :0x0x — » R is a c ontinuous function, then / admits the following Taylor 
expansion (cf . pg. 659, iPolak (fl997h ): 



f{x\92) - fix\9i) = Def{x\9{){92 - 9i) + -Dlf{x\9i + s{92 - 9i); {92 - 9i), (^2 - ^1)), 

for some s G [0, 1]. Assume further that the second Frechet differential of /(x|-), D'^{-; ■,■) 
satisfies a uniform Lipschitz condition: 

\Dlf{x\9i-l,l) - Dlf{x\92-n,l)\ < ^^11^1-^2111711' 

for all x, 9i,92 G 0, and some fixed G and 5 > 0. It is simple to observe that Theorem [T] 
and its proof extend line-by-line to the normed space setting. 
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5.3 Proof of Theorem U\ 

Proof. In the following, Bi denotes the unit ball of Banach space © = /oo([0, 1]), while Hi 
the unit ball of the RKHS H. For a given constant Cq > 1 with e"'^""''" < 1/2, define a 
sequence of measur able sets (-Bn,)n,>i , Bn = e„M] - jrC„M. ^ , where C„ = — 2<I>~^(e~'^'^"'^"). 



By Theorem 2.1 of Ivan der Vaart and van ZantenI (l2008ar) . the sequence of sets Bn admits 



the following useful properties: 

logN{3en,Bn,p)<6Conel, (24) 

Pr(e ^ Sn) < e"^"'"'", (25) 

Pr(p(0, e*) < 2e„) > e""'" for each i = l,...k. (26) 

The proof proceeds by verifying that all conditions in Theorem |3] hold for some con- 
stant C > 0. Define the following sequence of subsets Qn '■= Gk{Bn) C Gk{®)- By 
Assumptions (Bl) and (B4), there is c > such Ck{Qn,'r) > cr'^ for sufficiently small 
r > 0. 

Let C4 = £116*11^ < oo, then for any /i G H, ||/i|| < C4||/i||h (cf. van der Vaart and van ZantenI 



(l2008ah pg. 203). Thus, D i am(^„ ) < 2(e„ + QC^) < 2{en + C^enVWCon) (cf. 



van der Vaart and van ZantenI (2008b) pg. 1454, for the second equality). By Lemmata), 



log N{2en,Gn, dp) < k{logN{en,Bn,p)+log{e+eDiam{Bn)/en)). Combined with dal, 
we have log D{4:en,Gn,dp) < log N{2en,Gn,dp) < 0{kne^), due to (B4) and the assump- 
tion that logn = o(ne^). If we replace e„ by a sufficiently large multiple of e^, we shall 
obtain the bound ^ precisely. 

Turning to dU), by the union bound, Pr(G ^ Gn) < Ta=i ^^i^i i ^n) < fce-'^o""" < 
exp(— ne^(C + 4)), by choosing constant Cq sufficiently large (after C is fixed). 

Next, we consider condition Suppose that G = X^^LiPji^e. where p{9i,9*) < e„ 
for all i = 1, . . . , A;, and \pi — p*\ < e^/ {kDiam{Bn)'^). Combining Lemma[T]with As- 
sumption (B2) on the likelihood functions, we obtain that dxiPGoiPG) < dpK{Go, G) < 
Ci Yji<i,j<k QijP'^i^h for any Q G Q. It is simple to check that infg Ei<i,j<fc QijP^i^h ^3) ^ 
Eti(Pr ^P^)p\Ol Bi) + Eti \P^ - P*|Diam(i?„)2 < 2el Hence, 

U{dKiPGo,PG) < 2el) > Uipi9i,9*) < e„; \pi - p*\ < e2/(A;Diam(S„)2), i = 1, . . . , A:) 

> eM-knel/4)cs{el/{kDmm{Bnf))''^ 

-ka 



> C3 exp(-A;ne^/4) ( 4/c(l + C^VWConf 

> exp(-ne^C). 



(The second inequality is due to Assumption (B3) and (I26b . the fourth inequality is due 
to Assumption (B4)). In view of Assumption (B2), we obtain that condition Q holds by 
choosing a sufficiently large constant C. 

Finally, we shall choose M„ such that Mn^n 0, and Ck{Gn, ^In^-n) > c{Mnen)'^ > 
e^(C + 4). This is possible by taking M„ to be a large multiple of en^^'^. As a result, 
MnCn X €n ■ We concludc by invoking Thm[3] □ 
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5.4 Other auxiliary results 
Proof of Lemma m 

Proof, (a) Suppose that (771, ... , t/t) forms an e-covering for under metric p, where T = 
N{e, S, p) denote the (minimum) covering number. Take any discrete measure G{p, 6) = 
S(Li Pi^Oi - For each 6i there is an approximating 6[ among the ry^'s such that p{6i,6'-) < e. 
Let p' = {p[, . . . ,p'i^) be a A;-dim vector in the probability simplex that deviates from p by 
less than 6 in li distance: ||p' — < 6. Define G' = Yli=iPi^e'- Then dp{G,G') < 

Yli=i{Pi ^ P'i)p{Oi^ ^i) + \\P - p'lliDiam(e) < e + (5Diam(e). (This is easily seen by 
moving pi A p'j mass of 9i to the "nearby" 9j, while the remaining mass are moved in 
arbitrary way). It follows that a (e + (5Diam(0) -covering for t/fc(0) can be constructed by 
combining each element of a (5-covering in li metric of the k — 1 -probability simplex and k 
e-covering's of @. 

The covering number of k — 1-probability simplex is less than the number of cubes of 
length covering [0, l]'^ times the volume of {{p'l, ■ ■ ■ ,p'^) : p'j > 0, p'j < 1+6}, i.e., 

{k/6)''{l + 6)''/kl ~ (l + l/5)^e'=/\/2^. It follows that iV(e + (5Diam(e), ^fc (9), dp) < 
T^{1 + XjSfe^l^ph^. Take b = e/Diam(e) to achieve the claim. 

(b) Suppose that (r/i, . . . , Tyj-) forms an e-covering for under metric p, and T = 
N{e, S, p). Take any discrete measure G(p, 0) = X]i=i Pi^e^ ^ ^e> where k may be infin- 
ity. The collection of atoms Oi, ... ,9k can be subdivide into disjoint subsets 5i, . . . , St, 
some of which may be empty, so that for each t = 1, . . . ,T , p{9i,r]t) < e for any 9i G 5^. 
Thus, if we define p'^ = Y^^^i Pil{9i G St), and discrete measure G'{p', 77) = Ylt=i p't^m' 
then we are guaranteed that dp{G, G') < Y!1=i ELi Pi^i^i ^ St)p{9i, 7]t) < e. 

Let p" = {p'l, ... be a T-dim vector in the probability simplex that deviates from 
p' by less than 6 in li distance: \\p" — p'\\i < 6. Take G" = Ylt=iPt^rit- simple to 
observe that dp{G',G") < Diam(e)5. By triangle inequality, dp{G,G") < dp{G,G') + 
dp{G',G") < e + 5Diam(e). 

The foregoing arguments establish that (e + (5Diam(0))-covering in the Wasserstein 
metric for the subset Qs ^ ^(0) can be constructed by combining each element of the 
5-covering in li of the T — 1 simplex and a single covering of 0. From the proof of part 
(a), A^(e + (5Diam(0), Gs, dp) < (1 + l/(5)^e^/\/2^. Take 5 = e/Diam(0) to conclude. 

(c) Consider a G = Yl\=iPi^9t ^^^^ that dp{Go,G) < 2e. By definition, there 
is q e Q{p,p*) so that Y.ij QijPi^i ^^j) < 2e. Since J^j^Hj = P*' this implies that 
2e > X]i=i Pi miiij p{9*, 9j). Thus, for each i = I,... ,k there is a j such that p{9*,0j) < 
2e/p* < 2Me. Without loss of generahty, assume that p{9*,9i) < 2Me for all i = 
1, . . . ,k. For sufficiently small e, for any i, it is simple to observe that dp{Go, G) > 
\p* - pi\ miuj^i p{9*,9j) > \p* - pi\ miuj p{9*,9*)/2. Thus, \p* - Pi\ < 4:6 /m. 

Thus, an e/4 + 5Diam(0) covering in dp for {G S Qk{&) ■ dp{Go,G) < 2e} can 
be constructed by combining the e/4-covering for each of the k sets {9 € & : p{9,9*) < 
2Me} and the 6 / fc-covering for each of the k sets [p* — 4e/m, p| + 4e/m] . This entails that: 
iV(e/4+5Diam(0),{G e ^^(0) : dp{Go,G) < 2e},dp) < [snp^, N{e/4,Q' , p)]''{8ek/m6)K 
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Take S = e/(4Diam(6)) to conclude the proof. 



□ 



Proof of Lemma m 

Proof, (a) For arbitrary constant i? > 0, we have / \p{x) — p'{x)\\\x\\^dx < J||^||<^ \p — 
P'\M' + lM>RiP + P')M' < R'\\p-p'\\L^+R''''''\Kp\\X\\'+Ep,\\X\\'). Choosing 
R = [{EpWXW" + Ep,\\X\\'')/\p- p'\l^]^/' to conclude. 

(b) For any i? > 0, we have ij|^||<^ \pix) - p'{x)\dx < V^^"^ R'^/^[j\^^^^^^{p{x) - 
p'{x)ydxY/^ < V^^^R'^/^\\p-p'\\l2- We also have \p{x) -p' {x)\dx < ij|^.||>KP(x)+ 

p'{x)dx < R-^{Ep\\X\\' +Ep,\\X\\'). Thus, 11^-^11^1 < mfR^^oV^^ R'^/^p - p'U, + 
i?""* (Ep 1 1 X 1 1 + Ep/ 1 1 X 1 1 ) , which gives the desired bound. □ 



6 Appendix 

We outline in this section the proofs of theore ms [3] and [4| for completeness. Our proof fol 



lows the same steps as in iGhosal et all (l2000f) . with suitable modification for the inclusion 



of the Hellinger information function, which plays important roles in the specification of 
conditions for convergence and determining convergence rates. The proof consists of re- 
sults on the existence of test, which is then turned into probability bounds on the posterior 
contraction. 

A test (pn is a measurable indicator function of the iid sample Xi, . . . For each 
pair of discrete measures Go,Gi we consider tests for discriminating Go G ^(©) against 
aclosedball5(Gi,dp(Go,Gi)/2) = {G e ^(6) : dp(Gi,G) < dp(Gi, Go)/2}. In the 
following Pg denotes the expectation under the mixture distribution given by density pc- 

Lemma 5. For some fixed k < oo, suppose thatQk{Q) C ^ C ^(B). Then, for every pair 
of discrete measures (Go, Gi) G Qk{^) x G there exist tests {(fn} that have the following 
properties: 

Pco^n < exp[-nCkiG,dpiGo,Gi))] (27) 
sup Pcil-'Pn) < exp[-nGfc(g,dp(Go,Gi))]. (28) 

G£g{e):dp{G,Gi)<dp{Go,Gi)/2 

Next, existence of test can be shown for discriminating Go against the complement of 
a closed ball: 

Lemma 6. Let Q C G{Q). Go G Gk{&) Q G for some k < oo. Suppose that for some 
non-increasing function D{e), some e„ > and every e > e„, 

D{e/2, {Geg:e< dp{Go,G) < 2e}, dp) < D{e). (29) 
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Then, for every e > e„ there exist tests fn (depending on e > 0) such that for any t G N 

[Diam{0)/e] 

Pco^n < D{e) Yl exp[-nC7fc(g,te)] (30) 
t=i 

sup Pcil-^n) < exp[-nCk{G,te)]. (31) 

Geg:dp{Go,G)>te 

Proof of Theorem Sand g] By aresult of Ghosal et al tahosal et all l2000h (Lemma 8.1, 
pg. 524), for every e > and probability measure 11 on the set B{e) defined by Eq. ([6]), we 
have, for every C > 0, 

^c.(/n^^(G)<exp(-(l+C)„.^)) 

This entails that, for a fixed C > 1, there is an event An with -probability at least 
1 — (Cne^)~^, for which there holds: 

/n 
Y[pG{Xi)/pG,{X,)dUiG) > eM-'^nelMBien)). (32) 
i=i 

Let On = {GG g(G) : dp(Go, G) > M„e„}, Sn,j = {G G Gn : dp{Go, G) e [ie„, [j + 
l)e„)} for each j > 1. The conditions specified by Lemma |6] are satisfied by setting 
D{e) = exp(ne^) (constant in e). Thus there exist tests for which Eq. (l30l ) and (|3TI ) 
hold. Then, 

PGji{G(^On\Xi,...,Xn) 
= PGobnn(GG 0„|Xi,...,X„)]+PGo[(l-¥'n)n(GG 

< PG,VPn^{G e On\Xi, ...,Xn)]+ PgoHK) + ^Go [(1 " VnMG G . . . 

Exploiting Lemma[6l all terms in the preceeding display can be shown to vanish as n — )■ oo . 
The proof for Theorem |3]proceeds in a similar way to Theorem 2.1 of iGhosal et al. 
while the proof for Theorem |4]is similar to their Theorem 2.4. 
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